Overview of Data Sets

We are analyzing the quality of wine based on 11 predictors: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulfates, alcohol. Quality, in this instance, is defined as an indicator of its craftsmanship and, thus, desirability which can be used, for example, in pricing. Quality does not necessarily indicate if a wine has gone bad. We will use two data sets – one for white wine and the other for red – to create two respective machine learning models. There are 4899 observations in the white wine data set and 1600 observations in the red wine data set.


Overview of Research Questions

We are interested in conducting analysis on both the white and red wine to assess the quality of wine in new wines in production so that they may be accurately priced and marketed.


Loading Data

This project uses data on the white and red wine, which records information of the chemical makeup of the wine.

white_og <- read.csv("Wine Dataset/winequality-white.csv", sep = ";")
red_og <- read.csv("Wine Dataset/winequality-red.csv", sep = ";")


Data Cleaning

To clean our data, we clean the column names, change quality into a factor so we can analyze it with classification models, and add a type of White or Red to each data set. We removed any rows in the white wine data set with a quality value of 9 because there are too few instances, and thus, it inhibits our models from performing correctly later on. The red wine data set did not have any quality values of 9. We also create a combined data frame with both values from white and red wine to see if there are significant differences between red and white evaluations for quality.

Data Split

In our data split, we put a proportion of .7 of each original data set into a training data set and a proportion of .3 into the testing data sets, stratifying by quality. In this section, we also folded our data into 5 folds for later cross validation use.

set.seed(1234)
white_split <- white %>% 
  initial_split(prop = 0.7, strata = "quality")

white_train <- training(white_split)
white_test <- testing(white_split)

red_split <- red %>% 
  initial_split(prop = 0.7, strata = "quality")

red_train <- training(red_split)
red_test <- testing(red_split)

white_fold <- vfold_cv(white_train, v = 5)
red_fold <- vfold_cv(red_train, v = 5)

Exploratory Data Analysis

What sort of factors do winemakers and sommeliers look for in a quality wine? Generally, quality is determined by acidity, dryness, flavor profile or taste, alcohol content, and how well the wine is preserved or how it changes as it is stored. In our exploratory data analysis, we will analyze our predictors based on these five categories. First, acidity levels can be summarized through the ph levels, fixed.acidity, volatile.acidity, and citric.acid content. Dryness is determined by the density. Taste can be broken down into sweetness and saltiness, which are caused by residual.sugar and chlorides respectively. We will analyze alcohol content singularly to see its effect on the wine quality. Lastly, sulfurous compounds are what is generally used to preserve wine, so we will analyze free.sulfur.dioxide, total.sulfur.dioxide, and sulphates to see if the way a wine is preserved interacts with wine quality in an interesting way.

Our data can be split into two data sets because experts look for different levels of acidity, sugar, etc. for white wine and red wine. Thus, we will have 3 different representations of the data: one for white wine, one for red wine, and one for both.

All of our predictors are continuous, so we will use box plots, histograms, and scatter plots to visualize our data and determine feature selection.

First, let’s see the distribution of quality between both data sets of wine.

ggplot(combinedWine_og, aes(quality)) + geom_bar(color = "black", fill = "pink") + labs(title = "Histogram of Quality - Total Wine") + xlab("Quality of Wine") + ylab("Count") 

We can see that it is normally distributed, meaning that most wine has a quality value of 5 or 6, with few exceptionally good wines having a value of 8 or 9, and low quality wines having a quality value of 3. Based on their low frequency, we can further justify selecting against of quality values of 9 in our initial data cleaning.

Next, we look at the correlation matrices for white and red wine separately to determine which predictors are correlated.

Correlation Matrices

White

white %>% 
  select(where(is.numeric)) %>% 
  cor() %>% 
  corrplot(type = 'lower', diag = FALSE, 
           method = 'color', main = 'White Wine Correlation Plot')

Red

red %>% 
  select(where(is.numeric)) %>% 
  cor() %>% 
  corrplot(type = 'lower', diag = FALSE, 
           method = 'color', main = 'Red Wine Correlation Plot')

In the white wine correlation matrix, density and residual sugar; and density and alcohol are the predictors with the highest correlation. Total sulfur dioxide and free sulfur dioxide also have a moderate correlation.

In the red wine correlation matrix, citric acid and fixed acidity; density and fixed acidity; citric acid and volatile acidity; pH and fixed acidity; and free sulfur dioxide and total sulfur dioxide are highly correlated with each other.

Scatter plots

To visualize and validate these correlations, let’s take a look at the scaled scatter plot of each predictor plotted against its correlated counterpart.

White

# scaled data sets 
scaled_white = as.data.frame(scale(select(white, c(-quality,-type))))
scaled_red = as.data.frame(scale(select(red, c(-quality,-type))))

# scaled white residual sugar versus density 
ggplot(scaled_white, aes(x = residual.sugar, y = density)) + geom_point()+scale_x_continuous(name = "Residual Sugar") + scale_y_continuous(name = "Density") + geom_smooth(method = "lm", se = FALSE)+ ggtitle(" Residual Sugar Versus Density") + theme(plot.title = element_text(size = 5))

# scaled white alcohol versus density 
ggplot(scaled_white, aes(x = alcohol, y = density)) + geom_point()+scale_x_continuous(name = "Alcohol") + scale_y_continuous(name = "Density") + geom_smooth(method = "lm", se = FALSE)+ ggtitle("Alcohol Versus Density") + theme(plot.title = element_text(size = 5))

# scaled white free sulfur dioxide versus total sulfur dioxide
ggplot(scaled_white, aes(x = free.sulfur.dioxide, y = total.sulfur.dioxide)) + geom_point()+scale_x_continuous(name = "Free Sulfur Dioxide") + scale_y_continuous(name = "Total Sulfur Dioxide") + geom_smooth(method = "lm", se = FALSE)+ ggtitle("Free Sulfur Versus Total Sulfur Dioxide") + theme(plot.title = element_text(size = 5))

Red

# scaled red volatile acidity versus citric acid
ggplot(scaled_red, aes(x = volatile.acidity, y = citric.acid)) + geom_point()+scale_x_continuous(name = "Volatile Acidity") + scale_y_continuous(name = "Citric Acid") + geom_smooth(method = "lm", se = FALSE)+ ggtitle("Volatile Acidity Versus Citric Acid") + theme(plot.title = element_text(size = 5))

# scaled red fixed acidity versus citric acid 
ggplot(scaled_red, aes(x = fixed.acidity, y = citric.acid)) + geom_point()+scale_x_continuous(name = "Fixed Acidity") + scale_y_continuous(name = "Citric Acid") + geom_smooth(method = "lm", se = FALSE)+ ggtitle("Fixed Acidity Versus Citric Acid") + theme(plot.title = element_text(size = 5))

# scaled red fixed acidity versus pH
ggplot(scaled_red, aes(x = fixed.acidity, y = pH)) + geom_point()+scale_x_continuous(name = "Fixed Acidity") + scale_y_continuous(name = "pH") + geom_smooth(method = "lm", se = FALSE)+ ggtitle("Fixed Acidity Versus pH") + theme(plot.title = element_text(size = 5))

# scaled red fixed acidity versus density
ggplot(scaled_red, aes(x = fixed.acidity, y = density)) + geom_point()+scale_x_continuous(name = "Fixed Acidity") + scale_y_continuous(name = "Density") + geom_smooth(method = "lm", se = FALSE)+ ggtitle("Fixed Acidity Versus Density") + theme(plot.title = element_text(size = 5))

# scaled red free sulfur dioxide versus total sulfur dioxide
ggplot(scaled_red, aes(x = free.sulfur.dioxide, y = total.sulfur.dioxide)) + geom_point()+scale_x_continuous(name = "Free Sulfur Dioxide") + scale_y_continuous(name = "Total Sulfur Dioxide") + geom_smooth(method = "lm", se = FALSE)+ ggtitle("Free Sulfur Dioxide Versus Total Sulfur Dioxide") + theme(plot.title = element_text(size = 5))

Based on the scatter plots, we can visualize the correlations between the predictors. For example, for white wine, density has a strong positive correlation with residual sugar and a moderate negative correlation with alcohol. Through these scatter plots, we confirm the existence of correlations predicted by our initial correlation matrix.

Now, we can take a look at the box plots for several of our predictors to see the ways that they interact with wine quality, isolated from the other predictors. First, we will visualize acidity levels which can be measured through fixed acidity, volatile acidity, and citric acid levels. As shown above in the scatter plots, these three predictors are highly correlated with each other in red wine.

#fixed acidity
ggplot(combinedWine, mapping = aes(x = `fixed.acidity`, y = quality, fill = quality)) + geom_boxplot() + geom_point(white_train, mapping = aes(x = alcohol, y = quality)) + labs(title = "Red and White Fixed Acidity Content versus Quality", fill = "Quality")  + scale_fill_brewer(palette="PuRd") + facet_wrap(. ~ type, nrow = 1) +coord_cartesian( xlim = c(0,16), ylim = NULL, default = FALSE )

# volatile acidity 
ggplot(combinedWine, mapping = aes(x = `volatile.acidity`, y = quality, fill = quality)) + geom_boxplot() + geom_point(white_train, mapping = aes(x = alcohol, y = quality)) + labs(title = "Red and White Volatile Acidity Content versus Quality", fill = "Quality")  + scale_fill_brewer(palette="PuRd") + facet_wrap(. ~ type, nrow = 1) + coord_cartesian( xlim = c(0,1), ylim = NULL, default = FALSE )

# citric acid 
ggplot(combinedWine, mapping = aes(x = `citric.acid`, y = quality, fill = quality)) + geom_boxplot() + geom_point(white_train, mapping = aes(x = alcohol, y = quality)) + labs(title = "Red and White Citric Acid Levels Content versus Quality", fill = "Quality")  + scale_fill_brewer(palette="PuRd") + facet_wrap(. ~ type, nrow = 1)+ coord_cartesian( xlim = c(0,1), ylim = NULL, default = FALSE )

From these box plots, we can see that fixed acidity levels are relatively consistent in both red and white wine. Volatile acidity has a negative correlation with quality in red wine, but relatively consistent averages for each level of wine quality in white wine. Citric acid levels in red wine have a stronger positive correlation than in white wine. In general, we can see that acidity levels fluctuate more in red wine than in white wine.

Next, let’s take a look at dryness which is determined by the predictor density. Based on the correlation matrix and scatter plots, density also is correlated with residual sugar and alcohol in white wine and with fixed acidity in red wine.

# density
ggplot(combinedWine, mapping = aes(x = density, y = quality, fill = quality)) + geom_boxplot() + geom_point(white_train, mapping = aes(x = density, y = quality)) + labs(title = "Red and White Density Levels versus Quality", fill = "Quality")  + scale_fill_brewer(palette="PuRd") + facet_wrap(. ~ type, nrow = 1)+ coord_cartesian( xlim = c(.985,1.01), ylim = NULL, default = FALSE )

Although density is correlated with several predictors according to the correlation matrices and scatter plots, in this box plot, we can see that density stays relatively consistent, around 1, for each level of wine quality.

Next, we will look at the taste of the wine, which is determined by levels of sweetness and saltiness. These are affected by sugar levels and chlorides respectively.

#sugar content
ggplot(combinedWine, mapping = aes(x = residual.sugar, y = quality, fill = quality)) + geom_boxplot() + geom_point(white_train, mapping = aes(x = residual.sugar, y = quality)) + labs(title = "Red and White Residual Sugar Content versus Quality", fill = "Quality")  + scale_fill_brewer(palette="PuRd") + facet_wrap(. ~ type, nrow = 1) + coord_cartesian( xlim = c(0,25), ylim = NULL, default = FALSE )

# we removed outliers to ensure that the variation was not due to the outliers 
ggplot(red[red$free.sulfur.dioxide < 50,], aes(x = free.sulfur.dioxide, y = quality)) + 
  geom_boxplot(aes(fill = quality)) +
  labs(title = "Free Sulfur Dioxide for Red Wine", x = "Free Sulfur Dioxide", y = "Quality") +
  geom_point(width = 0.15) +
  scale_fill_brewer(palette = "RdPu")

#chlorides
ggplot(combinedWine, mapping = aes(x = chlorides, y = quality, fill = quality)) + geom_boxplot() + geom_point(white_train, mapping = aes(x = chlorides, y = quality)) + labs(title = "Red and White Chloride Content versus Quality", fill = "Quality")  + scale_fill_brewer(palette="PuRd") + facet_wrap(. ~ type, nrow = 1) + coord_cartesian( xlim = c(0,0.3), ylim = NULL, default = FALSE )

White wine has, on average, higher and more variable sugar levels than red wine while red wine has an on average higher chloride content than white wine. There seem to be a higher number of outliers in the values of chloride.

Next, we will analyze alcohol content, which can affect the taste of the wine as well.

ggplot(combinedWine, mapping = aes(x = alcohol, y = quality, fill = quality)) + geom_boxplot() + geom_point(white_train, mapping = aes(x = alcohol, y = quality)) + labs(title = "Red and White Alcohol Content versus Quality", fill = "Quality")  + scale_fill_brewer(palette="PuRd") + facet_wrap(. ~ type, nrow = 1) 

There is generally higher alcohol content associated with a wine of higher quality but there is not a significant different in averages between red wine and white wine.

Lastly, let’s look at the preservative content, which is determined by free sulfur dioxide, total sulfur dioxide, and sulfates.

#free sulfur dioxide
ggplot(combinedWine, mapping = aes(x = free.sulfur.dioxide, y = quality, fill = quality)) + geom_boxplot() + geom_point(white_train, mapping = aes(x = free.sulfur.dioxide, y = quality)) + labs(title = "Red and White Free Sulfur Dioxide Content versus Quality", fill = "Quality")  + scale_fill_brewer(palette="PuRd") + facet_wrap(. ~ type, nrow = 1) + coord_cartesian( xlim = c(0,150), ylim = NULL, default = FALSE )

#total sulfur dioxide
ggplot(combinedWine, mapping = aes(x = total.sulfur.dioxide, y = quality, fill = quality)) + geom_boxplot() + geom_point(white_train, mapping = aes(x = total.sulfur.dioxide, y = quality)) + labs(title = "Red and White Total Sulfur Dioxide Content versus Quality", fill = "Quality")  + scale_fill_brewer(palette="PuRd") + facet_wrap(. ~ type, nrow = 1) + coord_cartesian( xlim = c(0,300), ylim = NULL, default = FALSE )

#sulfates
ggplot(combinedWine, mapping = aes(x = sulphates, y = quality, fill = quality)) + geom_boxplot() + geom_point(white_train, mapping = aes(x = sulphates, y = quality)) + labs(title = "Red and White Sulfate Content versus Quality", fill = "Quality")  + scale_fill_brewer(palette="PuRd") + facet_wrap(. ~ type, nrow = 1) + coord_cartesian( xlim = c(0,1.5), ylim = NULL, default = FALSE )

From the box plots we can see that red wine generally has a lower sulfur dioxide content than white wine. Also, averages across each stratification of quality have similar values except for sulfates in red wine, which have a slight positive correlation with quality.

CONCLUDING SENTENCE; CONNECT TO HOW KNOWING THESE THINGS AFFECTS OUR MODELLING


Model Fitting

We will be fitting linear discriminant analysis, naive Bayes, single decision tree, random forest, and boosted tree models and compare accuracy metrics. Then, we will fit the three models with the best roc_auc to our testing data. First, let’s see how the models perform on the white wine data set.

white_recipe <- recipe(quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol, data = white_train) %>% step_dummy(all_nominal_predictors()) %>% step_normalize(all_predictors())

red_recipe <- recipe(quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol, data = red_train) %>% step_dummy(all_nominal_predictors()) %>% step_normalize(all_predictors())

We did all of our modeling in R-Scripts for efficiency purposes, since models generally take a long time to run. We will load all of results here, and intermittently call variables throughout the report to visualize our calculations.

# LDA 
load("WhiteLDA.rda")
load("RedLDA.rda")

# LDA with PCA
load("WhiteLDAPCA.rda")
load("RedLDAPCA.rda")

# Decision Tree
load("WhiteWineDecisionTree.rda")
load("RedWineDecisionTree.rda")

# Random Forest 
load("WhiteWineRandomForest.rda")
load("RedWineRandomForest.rda")

# Boosted Trees
load("WhiteWineBoostedTrees.rda")
load("RedWineBoostedTrees.rda")

Let’s first explore linear discriminant analysis and naive Bayes classification through k-fold cross validation.

#set up model with mode classification and engine MASS
wlda_model <- discrim_linear() %>% 
  set_mode("classification") %>% 
  set_engine("MASS") 

#add model and recipe to the workflow
wlda_wkflow<- workflow() %>% 
  add_model(wlda_model) %>% 
  add_recipe(white_recipe)

#create a fit between the workflow and folded data
wlda_fit_cross <- fit_resamples(wlda_wkflow, white_fold)

#determine the roc_auc of the LDA model on the folded training data
collect_metrics(wlda_fit_cross)
## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy multiclass 0.528     5 0.00640 Preprocessor1_Model1
## 2 roc_auc  hand_till  0.735     5 0.0159  Preprocessor1_Model1

TRANSITION FROM LDA TO BAYES

#set up model with mode classification and engine kLaR
#we used set_args(use_kernel = FALSE) based on Lab 3
wnb_mod <- naive_Bayes() %>% 
  set_mode("classification") %>% 
  set_engine("klaR") %>% 
  set_args(usekernel = FALSE) 

#add model and recipe to the workflow
wnb_wkflow <- workflow() %>% 
  add_model(wnb_mod) %>% 
  add_recipe(white_recipe)

#create a fit between the workflow and folded data
wnb_fit_cross <- fit_resamples(wnb_wkflow, white_fold)

#determine the roc_auc of the Naive Bayes model on the folded training data
collect_metrics(wnb_fit_cross)
## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy multiclass 0.451     5 0.00984 Preprocessor1_Model1
## 2 roc_auc  hand_till  0.734     5 0.0200  Preprocessor1_Model1

Through k-fold cross validation, we can see that the linear discriminant analysis model produces a more accurate model than the Naive Bayes model. Thus, we will use linear discriminant analysis on the white wine data set. Now, let’s explore how the model performs on our testing data.

add pca lda here

#fit the workflow to the training data
wlda_fit <- fit(wlda_wkflow, white_train)

#predict on the testing data, selecting for quality as the classification response
wlda_res <- predict(wlda_fit, new_data = white_test %>%  select(-quality), type = "class" ) 

#add a column for the actual value of quality next to the predicted value
wlda_res <- bind_cols(wlda_res, white_test %>%  select(quality)) 
# returning the accuracy, roc_auc, roc_auc curves, heatmap 
wlda_pred <- augment(wlda_fit, new_data = white_test) 
wlda_acc <- wlda_pred %>% accuracy(truth = quality, estimate = .pred_class)
wlda_rocauc <- wlda_pred %>% roc_auc(truth = quality, estimate = .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% mutate(model_type = "White LDA Model")
wlda_roccurve <- wlda_pred %>% roc_curve(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% autoplot()
wlda_confusionmatrix <- wlda_pred %>% conf_mat(truth = quality, estimate = .pred_class) %>% autoplot(type = "heatmap")

print(wlda_acc)
print(wlda_rocauc)
wlda_roccurve
wlda_confusionmatrix

Through the linear discriminant analysis model, we get an accuracy metric of .537 meaning that our model is moderately accurate. Now, let’s try several tree methods to see if the produce more accurate results on the training data set of white wine. We will first look at the model for a single decision tree.

# decision tree specification
wtree_spec <- decision_tree() %>%
  set_engine("rpart")

wtree_spec_class <- wtree_spec %>%
  set_mode("classification")
wclass_tree_fit <- wtree_spec_class %>%
  fit(quality ~ volatile.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol, data = white_train)
wclass_tree_fit %>%
  extract_fit_engine() %>%
  rpart.plot()

# augmented on training 
augment(wclass_tree_fit, new_data = white_train) %>%
  accuracy(truth = quality, estimate = .pred_class)

augment(wclass_tree_fit, new_data = white_train) %>%
  conf_mat(truth = quality, estimate = .pred_class)
# tuning cost complexity 
wclass_tree_wf<- workflow() %>%
  add_model(wtree_spec_class %>% 
              set_args(cost_complexity = tune())) %>% 
  add_formula(quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol)

param_grid <- grid_regular(cost_complexity(range = c(-3,-1)), levels = 10)

tune_res_white <- tune_grid(
  wclass_tree_wf,
  resamples = white_fold,
  grid = param_grid,
  metric = metric_set(accuracy)
)
#autoplot(tune_res_white)
wAutoPlot

# extracting the best cost complexity parameter
best_complexity <- select_best(tune_res_white)

wclass_tree_final <- finalize_workflow(wclass_tree_wf, best_complexity)

wclass_tree_final_fit <- fit(wclass_tree_final, data = white_train)
wclass_tree_final_fit %>%
  extract_fit_engine() %>%
  rpart.plot()
# augmented on training 
wdectree_pred <- augment(wclass_tree_final_fit, new_data = white_train) 
wdectree_acc <- wdectree_pred %>% accuracy(truth = quality, estimate = .pred_class)
wdectree_rocauc <- wdectree_pred %>% roc_auc(truth = quality, estimate = .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% mutate(model_type = "White Decision Tree Model")
wdectree_roccurve <- wdectree_pred %>% roc_curve(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% autoplot()
wdectree_confusionmatrix <- augment(wclass_tree_final_fit, new_data = white_train) %>%
  conf_mat(truth = quality, estimate = .pred_class) %>% autoplot(type = "heatmap")
print(wdectree_acc)
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy multiclass     0.647
print(wdectree_rocauc)
## # A tibble: 1 × 4
##   .metric .estimator .estimate model_type               
##   <chr>   <chr>          <dbl> <chr>                    
## 1 roc_auc hand_till      0.818 White Decision Tree Model
wdectree_roccurve

wdectree_confusionmatrix

# JUST LOAD IN INFO WITH THIS RATHER THAN RERUNNING 
load(file = "WhiteWineDecisionTree.rda")

decision tree analysis

Now let’s look at the random forest model.

#load saved data
load(file = "WhiteWineRandomForest.rda")

# setting random forest model up
wrandfor <- rand_forest() %>% set_engine("ranger", importance = "impurity") %>% set_mode("classification")

wrandfor_wf <- workflow() %>%
  add_model(wrandfor %>%
              set_args(mtry = tune(), trees = tune(), min_n = tune())) %>% 
  add_formula(quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol)

# tuning the model to find the best arguments 
param_grid1 <- grid_regular(mtry(range = c(1,9)), trees(range = c(15,17)), min_n(range = c(30,50)), levels = 8)

wtune_res_randfor <- tune_grid(
  wrandfor_wf,
  resamples = white_fold,
  grid = param_grid1,
  metric = metric_set(accuracy)
)

wAutoPlotRF <- autoplot(wtune_res_randfor)

# collecting metrics to find best mean
wbest_rocauc1 <- collect_metrics(wtune_res_randfor) %>% arrange(desc(mean))
wbest_metric1 <- select_best(wtune_res_randfor)

wrandfor_final <- rand_forest(mtry = 2, trees = 17, min_n = 30) %>% set_engine("ranger", importance = "impurity") %>% set_mode("classification")
wrandfor_fit_final <- fit(wrandfor_final, formula = quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol, data = white_train)

# variance importance plot
wVIP <- vip(wrandfor_fit_final)
wVIP

# extracting the metrics 
wrandfor_pred <- augment(wrandfor_fit_final, new_data = white_train) 
wrandfor_acc <- wrandfor_pred %>% accuracy(truth = quality, estimate = .pred_class) %>% mutate(model_type = "White Random Forest Model")
wrandfor_rocauc <- wrandfor_pred %>% roc_auc(truth = quality, estimate = .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% mutate(model_type = "White Random Forest Model")
wrandfor_roccurve <- wrandfor_pred %>% roc_curve(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% autoplot()
wrandfor_confusionmatrix <- augment(wrandfor_fit_final, new_data = white_train) %>%
  conf_mat(truth = quality, estimate = .pred_class) %>% autoplot(type = "heatmap")

print(wrandfor_acc)
print(wrandfor_rocauc)
wrandfor_roccurve
wrandfor_confusionmatrix

random forest analysis

Now, let’s look at the boosted tree model.

#load saved data
load(file = "WhiteWineBoostedTrees.rda")

wboost_spec <-  boost_tree(tree_depth = 5) %>% 
  set_engine("xgboost") %>% 
  set_mode("classification")

wboost_wf <- workflow() %>%
  add_model(wboost_spec %>%
              set_args(trees = tune())) %>% 
  add_formula(quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol)

param_grid3 <- grid_regular(trees(range = c(10,2000)), levels = 10)

wtune_res_boosted <- tune_grid(
  wboost_wf,
  resamples = white_fold,
  grid = param_grid3
)

wBoostedAutoPlot <- autoplot(wtune_res_boosted)
wBoostedAutoPlot

wbest_rocauc2 <- collect_metrics(wtune_res_boosted) %>% arrange(desc(mean))
wbest_metric2 <- select_best(wtune_res_boosted)
print(wbest_metric2)

wboost_final <- boost_tree(tree_depth = 5, trees = 231)%>% 
  set_engine("xgboost") %>% 
  set_mode("classification")

wboost_fit_final <- fit(wboost_final, formula = quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol, data = white_train)

# augmenting 
wpredicted <- augment(wboost_fit_final, new_data = white_train) 
wboosted_acc <- wpredicted %>% accuracy(truth = quality, estimate = .pred_class) %>% mutate(model_type = "White Boosted Trees Model")
wboosted_rocauc <-  wpredicted %>% roc_auc(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% mutate(model_type = "White Boosted Trees Model")
wBoostedROCCurve <- wpredicted %>% roc_curve(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% autoplot()
wBoostedConfusionMatrix <- wpredicted %>% conf_mat(truth = quality, estimate = .pred_class) %>% autoplot(type = "heatmap")

wpredicted
wboosted_acc
wboosted_rocauc
wBoostedROCCurve
wBoostedConfusionMatrix
# augment all the rocs from decision, randforest, boosted --> see which one has the best --> fit best on testing 
# WHITE
wbest_rocauc
wbest_rocauc1
wbest_rocauc2
wbest_roc_table <- rbind(wbest_rocauc[1,c(2,4)], wbest_rocauc1[1,c(4,6)], wbest_rocauc2[1,c(2,4)] ) %>% mutate(model_type = c("Decision Tree", "Random Forest", "Boosted Trees"))
wbest_roc_table

# We will test our model using Boosted Trees and Random Forest 

explain the best ones of the tree type models also mention between lda and naive bayes, lda was better, but we are gonna try using pca first

white_recipe_pca <- recipe(quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol, data = white_train) %>% step_dummy(all_nominal_predictors()) %>% step_normalize(all_predictors()) %>% step_pca(all_numeric_predictors(), num_comp = tune())

# column name(s) must match tune() above
tuneGrid <- expand.grid(num_comp = 1:ncol(white_recipe_pca$template))

# control tune_grid() process below
trControl <- control_grid(verbose = TRUE, allow_par = FALSE)

wlda_pca_wkflow <- workflow() %>% 
  add_model(wlda_model) %>% 
  add_recipe(white_recipe_pca)

pca_lda_fit <- wlda_pca_wkflow %>%
  tune_grid(resamples = white_fold,
            grid = tuneGrid,
            metrics = metric_set(accuracy),
            control = trControl)

pca_lda_metrics <- pca_lda_fit %>% collect_metrics()

ggplot(pca_lda_metrics, aes(x = num_comp, y = mean)) +
  geom_line(color = "#3E4A89FF", linewidth = 2, alpha = 0.6) +
  scale_x_continuous(breaks = 1:ncol(white_recipe_pca$template)) +
  facet_wrap(~.metric) +
  theme_bw()

pca_lda_fit %>% show_best(metric = "accuracy")

(bestTune <- pca_lda_fit %>% select_by_one_std_err(num_comp, metric = "accuracy"))

wlda_pca_wkflow_final <- wlda_pca_wkflow %>% finalize_workflow(bestTune)

fit_final <- wlda_pca_wkflow_final %>% fit(white_train)

white.PCALDA <- tibble(white_train,
       predict(fit_final, new_data = white_train, type = "class"), # predicted class
       predict(fit_final, new_data = white_train, type = "prob"), # posterior prob. for classes
       as_tibble(predict(fit_final, new_data = white_train, type = "raw")$x)) # LD scores
white.PCALDA

# plot
ggplot(white.PCALDA, aes(x = LD1, y = LD2)) +
  geom_point(aes(color = quality, shape = .pred_class)) + 
  theme_bw() +
  ggtitle("PCA-LDA (DAPC) on White Wine Training dataset, using 9 PC")

#augmented on training 
pcalda_fit <- augment(fit_final, new_data = white_train) 
pcalda_acc <- pcalda_fit %>% accuracy(truth = quality, estimate = .pred_class)
pcalda_rocauc <- pcalda_fit %>% roc_auc(truth = quality, estimate = .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% mutate(model_type = "White Wine LDA Model using PCA ")
pcalda_roccurve <- pcalda_fit %>% roc_curve(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% autoplot()
pcalda_confusionmatrix <- pcalda_fit %>%
  conf_mat(truth = quality, estimate = .pred_class) %>% autoplot(type = "heatmap")

pcalda_acc
pcalda_rocauc
pcalda_roccurve
pcalda_confusionmatrix
# lda using pca testing
pcalda_fit_test <- augment(fit_final, new_data = white_test) 
pcalda_test_acc <- pcalda_fit_test %>% accuracy(truth = quality, estimate = .pred_class)
pcalda_test_rocauc <- pcalda_fit_test %>% roc_auc(truth = quality, estimate = .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% mutate(model_type = "White Wine LDA Model using PCA")
pcalda_test_roccurve <- pcalda_fit_test %>% roc_curve(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% autoplot()
pcalda_test_confusionmatrix <- pcalda_fit_test %>%
  conf_mat(truth = quality, estimate = .pred_class) %>% autoplot(type = "heatmap")

pcalda_test_acc
pcalda_test_rocauc
pcalda_test_roccurve
pcalda_test_confusionmatrix

# random forest testing
wrandfor_pred_test <- augment(wrandfor_fit_final, new_data = white_test) 
wrandfor_acc_test <- wrandfor_pred_test %>% accuracy(truth = quality, estimate = .pred_class) %>% mutate(model_type = "White Random Forest Model")
wrandfor_rocauc_test <- wrandfor_pred_test %>% roc_auc(truth = quality, estimate = .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% mutate(model_type = "White Random Forest Model")
wrandfor_roccurve_test <- wrandfor_pred_test %>% roc_curve(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% autoplot()
wrandfor_confusionmatrix_test <- augment(wrandfor_fit_final, new_data = white_test) %>%
  conf_mat(truth = quality, estimate = .pred_class) %>% autoplot(type = "heatmap")

print(wrandfor_acc_test)
print(wrandfor_rocauc_test)
wrandfor_roccurve_test
wrandfor_confusionmatrix_test

# boosted trees testing 
wpredictedtest <- augment(wboost_fit_final, new_data = white_test) 
wboosted_acc_test <- wpredictedtest %>% accuracy(truth = quality, estimate = .pred_class) %>% mutate(model_type = "White Boosted Trees Model")
wboosted_rocauc_test <- wpredictedtest %>% roc_auc(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% mutate(model_type = "White Boosted Trees Model")
wBoostedROCCurveTesting <- wpredictedtest %>% roc_curve(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% autoplot()
wBoostedConfusionMatrixTesting <- wpredictedtest %>% conf_mat(truth = quality, estimate = .pred_class) %>% autoplot(type = "heatmap")

wpredictedtest
wboosted_acc_test
wboosted_rocauc_test
wBoostedROCCurveTesting
wBoostedConfusionMatrixTesting

white conclusion

Next, let’s see how the models perform on the red wine data set. We will be using the same type of models that we used in white wine in order to keep things consistent.

#lda model using cross validation
rlda_model <- discrim_linear() %>% 
  set_mode("classification") %>% 
  set_engine("MASS") 

rlda_wkflow<- workflow() %>% 
  add_model(rlda_model) %>% 
  add_recipe(red_recipe)

rlda_fit_cross <- fit_resamples(rlda_wkflow, red_fold)

collect_metrics(rlda_fit_cross)

#naive bayes model using cross validation
rnb_mod <- naive_Bayes() %>% 
  set_mode("classification") %>% 
  set_engine("klaR") %>% 
  set_args(usekernel = FALSE) 

rnb_wkflow <- workflow() %>% 
  add_model(rnb_mod) %>% 
  add_recipe(red_recipe)

rnb_fit_cross <- fit_resamples(rnb_wkflow, red_fold)

collect_metrics(rnb_fit_cross)

Through k-fold cross validation, we can see that the linear discriminant analysis model produces a more accurate model than the Naive Bayes model. Thus, we will also use linear discriminant analysis on the red wine data set.

# fitting the model
rlda_model <- discrim_linear() %>% 
  set_mode("classification") %>% 
  set_engine("MASS") 

# creating the workflow
rlda_wkflow <- workflow() %>% 
  add_model(rlda_model) %>% 
  add_recipe(red_recipe)

# fitting the model to training data
rlda_fit <- fit(rlda_wkflow, red_train)

# now using that fit to test the model on the testing data
rlda_res <- predict(rlda_fit, new_data = red_test %>%  select(-quality), type = "class" ) 
rlda_res <- bind_cols(rlda_res, red_test %>%  select(quality)) 

# returning our predictions and visualizations 
rlda_pred <- augment(rlda_fit, new_data = red_test)
rlda_acc <- rlda_pred %>% accuracy(truth = quality, estimate = .pred_class)
rlda_rocauc <- rlda_pred %>% roc_auc(truth = quality, estimate = .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8)%>% mutate(model_type = "Red LDA Model")
rlda_roccurve <- rlda_pred %>% roc_curve(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% autoplot()
rlda_confusionmatrix <- rlda_pred %>% conf_mat(truth = quality, estimate = .pred_class) %>% autoplot(type = "heatmap")

# predictions, accuracy, roc auc, roc curve, confusion matrix 
print(rlda_acc)
print(rlda_res)
print(rlda_rocauc)
rlda_roccurve
rlda_confusionmatrix

decision tree transition

# decision tree specification
rtree_spec <- decision_tree() %>%
  set_engine("rpart")

# setting mode to classification
rtree_spec_class <- rtree_spec %>%
  set_mode("classification")

rclass_tree_fit <- rtree_spec_class %>%
  fit(quality ~ volatile.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol, data = red_train)


rclass_tree_fit %>%
  extract_fit_engine() %>%
  rpart.plot()

# augmented on training 
augment(rclass_tree_fit, new_data = red_train) %>%
  accuracy(truth = quality, estimate = .pred_class)

augment(rclass_tree_fit, new_data = red_train) %>%
  conf_mat(truth = quality, estimate = .pred_class)

# augmented on testing 
augment(rclass_tree_fit, new_data = red_train) %>%
  conf_mat(truth = quality, estimate = .pred_class) 

augment(rclass_tree_fit, new_data = red_train) %>%
  accuracy(truth = quality, estimate = .pred_class)

# tuning cost complexity 
rclass_tree_wf<- workflow() %>%
  add_model(rtree_spec_class %>% 
              set_args(cost_complexity = tune())) %>% 
  add_formula(quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol)

param_grid <- grid_regular(cost_complexity(range = c(-3,-1)), levels = 10)

tune_res_red <- tune_grid(
  rclass_tree_wf,
  resamples = red_fold,
  grid = param_grid,
  metric = metric_set(accuracy)
)

autoplot(tune_res_red)

# extracting the best cost complexitiy parameter
best_complexity <- select_best(tune_res_red)

rclass_tree_final <- finalize_workflow(rclass_tree_wf, best_complexity)

rclass_tree_final_fit <- fit(rclass_tree_final, data = red_train)

rclass_tree_final_fit %>%
  extract_fit_engine() %>%
  rpart.plot()

# augmented on training 
augment(rclass_tree_final_fit, new_data = red_train) %>%
  accuracy(truth = quality, estimate = .pred_class)

augment(rclass_tree_final_fit, new_data = red_train) %>%
  conf_mat(truth = quality, estimate = .pred_class)

# augmented on testing 
augment(rclass_tree_final_fit, new_data = red_test) %>%
  conf_mat(truth = quality, estimate = .pred_class) 

augment(rclass_tree_final_fit, new_data = red_test) %>%
  accuracy(truth = quality, estimate = .pred_class)

random forest transition

# setting random forest model up
rrandfor <- rand_forest() %>% set_engine("ranger", importance = "impurity") %>% set_mode("classification")

rrandfor_wf <- workflow() %>%
  add_model(rrandfor %>%
              set_args(mtry = tune(), trees = tune(), min_n = tune())) %>% 
  add_formula(quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol)

# tuning the model to find the best arguments 
param_grid2 <- grid_regular(mtry(range = c(1,9)), trees(range = c(15,17)), min_n(range = c(30,50)), levels = 8)

rtune_res_randfor <- tune_grid(
  rrandfor_wf,
  resamples = red_fold,
  grid = param_grid2
)

rAutoPlotRF <- autoplot(rtune_res_randfor)

# collecting metrics to find best mean
rbest_rocauc1 <- collect_metrics(rtune_res_randfor) %>% arrange(desc(mean))
print(rbest_rocauc1)
rbest_metric1 <- select_best(rtune_res_randfor)

rrandfor_final <- rand_forest(mtry = 7, trees = 17, min_n = 32) %>% set_engine("ranger", importance = "impurity") %>% set_mode("classification")
rrandfor_fit_final <- fit(rrandfor_final, formula = quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol,data = red_train)

rVIP <- vip(rrandfor_fit_final)
rVIP

# saving 
save(rAutoPlotRF, rbest_rocauc1, rrandfor_final, rVIP, rrandfor_fit_final, file = "WhiteWineRandomForest.rda")

#load saved data
#load()

boosted tree transition

# Red Boosted Trees 
rboost_spec <-  boost_tree(tree_depth = 5) %>% 
  set_engine("xgboost") %>% 
  set_mode("classification")

rboost_wf <- workflow() %>%
  add_model(rboost_spec %>%
              set_args(trees = tune())) %>% 
  add_formula(quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol)

param_grid4 <- grid_regular(trees(range = c(10,2000)), levels = 10)

rtune_res_boosted <- tune_grid(
  rboost_wf,
  resamples = red_fold,
  grid = param_grid4
)
rBoostedAutoPlot <- autoplot(rtune_res_boosted)
rBoostedAutoPlot

rbest_rocauc2 <- collect_metrics(rtune_res_boosted) %>% arrange(desc(mean))
print(rbest_rocauc2)

rbest_metric2 <- select_best(rtune_res_boosted)
print(rbest_metric2)

rboost_final <- boost_tree(tree_depth = 5, trees = 231)%>% 
  set_engine("xgboost") %>% 
  set_mode("classification")
rboost_fit_final <- fit(rboost_final, formula = quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol, data = red_train)

rpredicted <- augment(rboost_fit_final, new_data = red_train) 
rboosted_acc <- rpredicted %>% accuracy(truth = quality, estimate = .pred_class) %>% mutate(model_type = "White Boosted Trees Model")
rboosted_rocauc <-  rpredicted %>% roc_auc(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% mutate(model_type = "White Boosted Trees Model")
rBoostedROCCurve <- rpredicted %>% roc_curve(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% autoplot()
rBoostedConfusionMatrix <- rpredicted %>% conf_mat(truth = quality, estimate = .pred_class) %>% autoplot(type = "heatmap")

rpredictedtest <- augment(rboost_fit_final, new_data = red_test) 
rboosted_acc_test <- rpredictedtest %>% accuracy(truth = quality, estimate = .pred_class) %>% mutate(model_type = "White Boosted Trees Model")
rboosted_rocauc_test <- rpredictedtest %>% roc_auc(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% mutate(model_type = "White Boosted Trees Model")
rBoostedROCCurveTesting <- rpredictedtest %>% roc_curve(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% autoplot()
rBoostedConfusionMatrixTesting <- rpredictedtest %>% conf_mat(truth = quality, estimate = .pred_class) %>% autoplot(type = "heatmap")

best fit transition

#RED
rbest_rocauc
rbest_rocauc1
rbest_rocauc2
rbest_roc_table <- rbind(rbest_rocauc[1,c(2,4)], rbest_rocauc1[1,c(4,6)], rbest_rocauc2[1,c(2,4)] ) %>% mutate(model_type = c("Decision Tree", "Random Forest", "Boosted Trees"))
rbest_roc_table

# We will test our model using Boosted Trees and Random Forest 
red_recipe_pca <- recipe(quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol, data = red_train) %>% step_dummy(all_nominal_predictors()) %>% step_normalize(all_predictors()) %>% step_pca(all_numeric_predictors(), num_comp = tune())

# column name(s) must match tune() above
tuneGrid <- expand.grid(num_comp = 1:ncol(red_recipe_pca$template))

# control tune_grid() process below
trControl <- control_grid(verbose = TRUE, allow_par = FALSE)

rlda_pca_wkflow <- workflow() %>% 
  add_model(rlda_model) %>% 
  add_recipe(red_recipe_pca)

rpca_lda_fit <- rlda_pca_wkflow %>%
  tune_grid(resamples = red_fold,
            grid = tuneGrid,
            metrics = metric_set(accuracy),
            control = trControl)

rpca_lda_metrics <- rpca_lda_fit %>% collect_metrics()

ggplot(rpca_lda_metrics, aes(x = num_comp, y = mean)) +
  geom_line(color = "#3E4A89FF", linewidth = 2, alpha = 0.6) +
  scale_x_continuous(breaks = 1:ncol(red_recipe_pca$template)) +
  facet_wrap(~.metric) +
  theme_bw()

rpca_lda_fit %>% show_best(metric = "accuracy")

(bestTune <- rpca_lda_fit %>% select_by_one_std_err(num_comp, metric = "accuracy"))

rlda_pca_wkflow_final <- rlda_pca_wkflow %>% finalize_workflow(bestTune)

rfit_final <- rlda_pca_wkflow_final %>% fit(red_train)

red.PCALDA <- tibble(red_train,
       predict(rfit_final, new_data =red_train, type = "class"), # predicted class
       predict(rfit_final, new_data = red_train, type = "prob"), # posterior prob. for classes
       as_tibble(predict(rfit_final, new_data = red_train, type = "raw")$x)) # LD scores

# plot
ggplot(red.PCALDA, aes(x = LD1, y = LD2)) +
  geom_point(aes(color = quality, shape = .pred_class)) + 
  theme_bw() +
  ggtitle("PCA-LDA (DAPC) on Red Wine Training dataset, using 9 PC")

#augmented on training 
rpcalda_fit <- augment(rfit_final, new_data = red_train) 
rpcalda_acc <- rpcalda_fit %>% accuracy(truth = quality, estimate = .pred_class)
rpcalda_rocauc <- rpcalda_fit %>% roc_auc(truth = quality, estimate = .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% mutate(model_type = "White Wine LDA Model using PCA ")
rpcalda_roccurve <- rpcalda_fit %>% roc_curve(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% autoplot()
rpcalda_confusionmatrix <- rpcalda_fit %>%
  conf_mat(truth = quality, estimate = .pred_class) %>% autoplot(type = "heatmap")

rpcalda_acc
rpcalda_rocauc
rpcalda_roccurve
rpcalda_confusionmatrix

final testing transition

# lda using pca test
rpcalda_fit_test <- augment(rfit_final, new_data = red_test) 
rpcalda_test_acc <- rpcalda_fit_test %>% accuracy(truth = quality, estimate = .pred_class)
rpcalda_test_rocauc <- rpcalda_fit_test %>% roc_auc(truth = quality, estimate = .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% mutate(model_type = "Red Wine LDA Model using PCA")
rpcalda_test_roccurve <- rpcalda_fit_test %>% roc_curve(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% autoplot()
rpcalda_test_confusionmatrix <- rpcalda_fit_test %>%
  conf_mat(truth = quality, estimate = .pred_class) %>% autoplot(type = "heatmap")

rpcalda_test_acc
rpcalda_test_rocauc
rpcalda_test_roccurve
rpcalda_test_confusionmatrix

red conclusion

Final thoughts